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Abstract 


The  generation  of  recognition  programs  by  hand  is  a  time-consuming,  labor-intensive  task  that 
typically  results  in  a  special  purpose  program  for  the  recognition  of  a  single  object  or  a  small  set 
of  objects.  Recent  work  in  automatic  code  goieration  has  demonstrated  the  feasibility  of  auto¬ 
matically  generating  object  recognition  programs  firom  CAD-based  descriptions  of  objects.  Many 
of  the  farograms  which  perform  automatic  code  generation  employ  a  common  paradigm  of  utiliz¬ 
ing  explicit  objea  and  sensor  models  to  predict  object  appearances;  we  refer  to  the  paradigm  as 
appearance-based  visitxi,  and  refer  to  die  programs  as  vision  algorithm  compilers  (VACs).  A 
CAD-hke  object  model  augmented  with  sensor-specific  informadon  like  color  and  reflectance,  in 
conjunction  with  a  sensor  model,  provides  all  the  information  needed  to  predict  the  appearance 
of  an  objea  under  any  specified  set  of  viewing  condidons.  Appearances,  characterized  in  terms 
of  feature  values,  can  be  predicted  in  two  ways:  analytically,  or  synthetically.  In  relatively  simple 
domains,  feature  values  can  be  analytically  determined  fom  model  information.  However,  in 
complex  domains,  the  analytic  prediction  method  is  impractical.  An  alternative  method  for 
appearance  prediction  is  to  use  an  appearance  simulator  to  generate  synthetic  images  of  objects 
which  can  then  be  processed  to  extract  feature  values.  In  this  paper,  we  discuss  the  paradigm  of 
appearance-based  vision  and  present  in  detail  two  specific  VACs:  one  that  computes  feature  val¬ 
ues  analytically,  and  a  second  that  utilizes  an  appearance  simulator  to  synthesize  sample  images. 
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1  Introduction 

The  fencnttion  of  oligeci  recognition  prognms  Iqr  hand  is  a  time-coosaming,  labor*tniaisive  process  thtt  typically 
results  in  a  special  poipoae  prognm  for  the  recognitioo  of  a  sin^e  object  or  a  small  set  (tfobgects.  The  reason  for  tins 
Iks  in  the  design  methodology:  on  the  basis  of  a  iqsesentative  set  of  sample  images,  the  designer  selects  a  set  of 
image  features  and  ^Mciiles  the  procednre  to  be  used  in  matching  image  feabnes  to  otgM  featnres.  Featore  section 
recpiires  application  of  many  dififecent  operators  to  the  sampk  images  in  order  to  detennine  the  best  feature  set.  Opti- 
mkatknrf  both  femureeatiactionopentnra  and  the  inaichmgpmc«tomwMpiiiwT*Y»eafcF"g”rfpn*«»"f*f»Tri»»*Mg»» 

extensive  experimeatation.  The  entire  process  requires  a  highly  and  motivated  designee.  Whik  the  result  is 
<rfkn  an  efSdent,  robust  solutkn  to  the  proUem.  ite  oveiali  cost  is  so  high  as  to  be  prohibitive  fer  many  applications. 

Recently,  driven  in  part  by  an  effort  to  make  computer  visko  systems  mote  economically  featibk,  reaeaich  has  been 
conducted  towards  the  go^  of  automatically  generating  recognition  code  fiom  a  CAIMmaed  description  an  object 
l/tast  cunent  industrial  parts  are  designtid  aiMl  manufectured  using  mmpiiiwr-aMwH  tods,  so  CAD  deacrqitions  exist 
for  most  pans:  amomatk  generatinn  of  leeognition  cnHe  from  the  mwio  mwitri  intinHMrinn  naed  for  deagn  jwul  wn^w. 
ufecture  would  be  an  efficient,  cost-effective  approach.  A  number  of  programs  fer  ■"•nwnwir  gfwwrwfin"  of  object  rec¬ 
ognition  code  have  been  written,  and  many  these  programs  employ  a  goHwnnn  purarfigm  in  which  explicit  object 
and  sensw  models  are  used  to  predkt  objea  appearances.  We  refer  to  the  paradigm  as  eppeorance-tered  vision,  and 
programs  whkh  generate  object  recognition  programs  ate  called  virion  o/gofiikn  coatptkrs.  or  VACs. 

Appearance-based  vision  represents  an  extension  to  die  familiar  paradigm  model-based  vision.  Model-based 
vision  defines  an  execution-time  methodology  of  matching  observed  imaga  features  to  modd  features,  but  does  not 
address  the  issue  of  defining  a  methodology  for  selecting  ddier  feannes  or  matrhing  procedures.  Abearance-based 
visko  systems  employ  model-based  matching  during  the  execntirni-tima  nMngntrinw  phaaa,  Hit  akn  employ  a  fJiarar. 
teristic  methodology  during  the  Qff*Iine  oompilatinn  phase,  during  which  faatnma  are  and  processing  s&ate- 

gks  determined. 

In  principk,  a  CAD-Uke  object  model,  augmented  with  sensor-bccific  information  i«i»  imrfiira  color,  roughness,  and 
reflectance,  can  he  nred  in  oonjunction  with  a  sensor  model  to  predict  the  tbpearanoe  of  the  object  under  any  speci¬ 
fied  set  of  viewing  conditions.  For  example,  knowing  the  color  and  reflectance  oi  a  polygonal  pav*  permits  a  com- 
pkte  determination  of  its  variation  in  appearance  with  respea  to  a  video  camera  and  a  fixed  light  source.  VACs  can 
predkt  objea  qipearanoes  in  two  differem  wi^:  analytically,  or  synthetically. 

In  relatively  sinqile  domaiM,  feature  values  can  be  analytically  determined  from  modd  information.  For  «»amp*^  .  the 
set  of  visible  edges  of  a  polyhedron  widi  kmbenian  si^aces  is  a  straightfiorward  computation.  Similariy,  the  collec¬ 
tion  of  visible  surfimes  with  respect  to  a  lar^  aensm  can  be  easily  computed.  As  objects  and  their  properties  grow  in 
complexity,  however,  effects  soch  as  self-shadowing  and  inier-ieflection  bffw***  more  important,  but  are  difficult  to 
mcorpotareinmanandysk.  AralytkpredictitMflf  appeaiMitMtSMlliignKlnmimpmrtirMl  finrarmiitAvnaiM 

An  alternative  to  analytic  prediction  appearances  is  the  use  of  an  appearance  nmnitnr  An  qb^arance  simniamr 
generates  synthetic  knages  oi  objects  under  specific  vkwing  conditions,  with  respea  to  a  given  sensor.  An  appear- 
mce  simulator  can  be  used  to  generate  a  rqxesentative  collection  of  sampk  images,  wfaidi  can  then  be  processed  and 
analyzed  to  extract  the  feature  values  that  characterize  object  appearances.  Thus,  a  VAC  n«niving  an  sim¬ 

ulator  is  a  computational  impknientation  of  the  traditional  ha^-goieration  approach  to  building  oljea  recognition 
systems. 

In  ihk  paper,  we  dkeoss  the  paradigm  of  appearance-based  vision.  In  the  next  section,  we  review  the  state-of-the-art 
in  appearance-based  vision  a^  the  automatic  generation  of  object  recognition  ptograms.  In  the  courre  oi  the  review, 
the  definite  characteristics  of  qqKaranoe-based  visioa  systems  will  be  noted.  Fdknring  die  review,  we  will  present 
in  detafl  two  VACs  whidi  typify  sppearanoe-based  systems.  In  section  3,  we  ptesem  a  VAC  that  employs  andytic 
prediction  df  appearances,  sod  the  advantages  and  limiiations  this  approach  are  diaenssed.  Then,  in  section  4,  we 
present  a  VAC  that  ntilues  an  appearance  simuiator.  A  Inkf  summary  condudes  the  pqier. 
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2  The  Paradigm  of  Appearance-Based  Vision 

llie  history  of  oompiiier  viskm  research  has  laigdy  been  a  atody  in  nuking  viiian  systems  woriL  Little  anemioa  has 
been  paid  to  the  saidy  of  Iww  ID  design  and  buiM  andkaikn  syaems.  exanqde,  the  dominam  pacadigin  in  com- 
puter  vision  is  that  of  model-based  vision.  Briefly,  the  modd-based  paeadigm  cm  be  diaiacietized  as  hypothesixe- 
jmdict-ver^.  given  a  collection  of  image  features,  a  msich  of  M  image  featme  to  a  moddfeamr^nse 

the  hypothesized  match  to  predicr  the  image  locatioos  of  other  modd  fiBatntes;  the  predictions  and  iqxiate  the 

hypothesis.  The  paiad^  does  not  qtediy  bow  to  select  the  fleaiiites  to  use,  or  how  10  pedonn  the  matching.  The  par- 
adigm  defines  an  approach  to  execution-tinie  processiiig,  rather  Aim  M  approach  to  system  building. 

IVpically.  a  model-based  vision  systmi  is  built  by  hand  tfarou^  a  thne-oonsuming,  Ubar-intensive,  experimental  pro¬ 
cess.  The  designer  specifies  a  tadc  scenario  and  obtains  sample  images.  The  images  are  rqteaiedly  processed  usii^  a 
variety  of  operators  until  the  designer  has  sdected  a  set  of  fisatures  that  is  adequate  fior  the  tadc.  Hie  designer  then 
determines  appropriate  values  to  characterize  the  features,  hniflements  a  matching  pnicedure,  and  experimentally 
“tunes’*  the  procedure  to  perform  optimally.  The  resulting  programs  are  often  extremely  efficient  and  ounpetent  for 
the  intended  aniUcation,  but  are  difficult  to  extend  or  moi^.  Moreover,  every  modd-based  system  odces  diout  the 
same  amount  of  effmt  to  devdop. 

Because  every  computer  vision  system  is  essentially  a  custom  soiodon  to  a  specific  problem,  a  ^cal  computer 
vision  system  is  expensive  to  develop  and  install,  and  is  cqidile  of  leoQgniziag  only  a  single  part  or  a  small  number 
of  parts  under  very  special  conditions.  Modifications  to  existing  systems  are  difficult  to  make,  and  the  cost  to  develop 
a  new  system  is  as  high  as  that  of  the  first  system.  Cleariy  this  is  M  unacceptable  shndon.  For  computer  viaon  sys¬ 
tems  to  be  practical,  they  most  be  cost-effective.  This  means,  among  other  things,  fliat  a  oonqinter  vision  system  must 
be  economical  to  devek^.  install,  and  modify. 

Appearance-based  vision  addresses  the  problem  of  building  cost-effective  computer  vision  systems;  it  tpecifies  a 
methodology  for  the  automatic  generation  of  object  recognition  programs.  Appearance  based  vision  is  an  extension 
of  the  model-based  paradigm  that  formalizes  and  amomates  the  design  process,  ^ipearance-based  vision  can  be 
characterized  as  an  automated  process  of  analyzing  the  aiqiearauces  of  ot^jects  under  ^lecified  observation  condi¬ 
tions,  followed  by  the  automatic  generation  of  model-based  object  lecopiiiian  programs  based  on  the  preceding  anal¬ 
ysis.  An  appearance-based  system  is  called  a  vision  algorithm  compile^  or  VAC. 

The  appearance  of  an  object  is  a  functioo  of  both  the  propcniBS  of  the  object,  and  die  properties  of  the  sensor  system 
as  well.  Object  models  must  be  more  sophisticated  thm  oonvetuiooal  CAD  models,  which  only  rqiresent  object 
geometry.  Models  for  appearance-based  vision  most  indude  any  infionnadon  that  contribiites  to  the  qipearance  of  an 
object  with  respect  to  a  sensor  For  example,  3D  geometry  is  necessary  to  predkt  apparem  shape,  but  must  be  ang- 
menied  with  information  about  suffice  roughness,  reflectance,  transmittance,  and  cohx  Since  the  appearance  of  an 
object  varies  with  respect  to  the  sensor,  sensor  models  must  dso  be  specified.  A  sensor  model  must  include  informa¬ 
tion  about  the  relative  geometry  ai  the  illuminant  and  detectors,  as  as  iaformatinn  about  the  features  detectable 
by  the  sensor. 

One  characteristic  of  appearance-based  visioo  is  that  both  objects  and  sensors  are  erqdicidy  modeled,  and  tfaerefote 
exchangeable.  Hence,  a  given  VAC  cm  generate  objea  recognition  code  for  many  diffierent  objects  using  the  same 
sensOT  model,  or  the  set  of  objects  can  be  fixed  and  the  sensor  modds  varied,  ^ipeatanoe-baaed  systems  emborfy  M 
abstract  principle  of  qipearance  prediction  based  on  the  use  of  accutaiB  modds. 

The  task  of  object  recognition  is  conqwsed  ci  two  nihtaslrr  object  kknt^catiom,  in  which  objects  in  the  scene  ate 
identified,  and  object  localiuuUm,  in  which  the  exact  pose  Qiodtion  and  orieatttion)  of  m  identified  object  is  com¬ 
puted.  A  given  object  recognition  task  may  consist  of  ddier  or  both  siffitadcs,  and  VACs  have  been  constructed  to 
solve  all  the  different  combinatioos  of  identificatioo  and  locdiiation. 

A  VAC  incorporates  a  two  stage  approach  to  object  rect^nition.  Ihe  first  stage  is  executed  ofMine  and  consists  ot 
perfuming  malysis  of  prediciedobiixt  appearances  and  the  generation  of  object  tecognithm  code.  The  second  stage 


isg»eai<<xlon4ii«g.y¥>«MM«»of«IT^y»qg»fagp«vkMriygHiftfiifricndeipiq»iitiiMges.TTietoaiageisexecii^ 
only  once  for  a  given  objea  recognition  task,  and  can  be  idadvdy  expensive.  Hie  second  stage  is  executed  many 
times,  and  must  be  both  fimt  and  cost-effective.  The  high  cost  of  the  first  stage  is  amortized  over  a  laige  number  (ff 
executions  of  the  second  stage. 

During  the  first,  off-line  stage  of  processing,  the  appeatances  of  the  object  are  predicted  over  the  esqiectcd  range  of 
viewpoints.  The  predicted  qipearances  are  mulyz^  to  deiennine  the  set  fieianres  that  are  useful  for  recognition. 
Frequently,  identification  and  localization  are  peifonned  most  effickatly  using  difiSBrent  feature  sets.  Once  die  feamre 
set  is  detennined.rq»eaentative  values  for  the  features  are  detwmined  and  compikd  into  a  recognition  strategy. 

The  second,  on-line  stage  of  an  appearance-based  system  is  nothing  more  than  the  inn-time  executum  of  the  gener¬ 
ated  strategy.  Since  there  me  many  dififetent  computatkMMd  strategies,  the  on-4ine  stage  varies  considerably  between 
systems.  The  important  prindpie  is  that  extensive  ofiMine  anafysis  can  be  used  to  make  the  on-line  stage  as  efiBdent, 
robust,  and  cost-effective  as  possiUe. 

In  the  next  subsection,  we  presem  an  historical  overview  at  research  on  appearance-based  vision.  Then,  building  cm 
the  historical  perspective,  we  enumerate  and  elaborate  on  the  commonalides  between  the  systems;  it  is  this  set  of 
common  characteristics  that  define  the  paradigm  of  appearance-baaed  vision. 

2.1  Historical  Perspective 

Goad  [9]  presented  an  eariyversian  of  a  VAC.  He  noted  diat  the  conBpmaiionalactiviqf  of  an  application  vision  sys- 
tern  could  be  split  up  into  two  stages:  an  analysis  stage,  in  which  uaeM  infonnatioo  about  tte  tadc  can  be  compiM; 
and  an  execution  stage  in  which  the  compiled  infonnatian  is  ntilized  to  peiferm  object  recognition.  Moreover,  the 
compilation  stage  is  performed  off-line  orily  once,  at  consideaMe  conqmtstional  expense,  while  the  execution  stage 
is  executed  on-line  many  times,  and  should  be  optimized  to  be  as  rapid  as  posaiUe. computational  expense  of  the 
off-line  stage  is  then  offset  by  the  savings  realized  by  die  repeated  execution  of  the  optimized  <m-Iine  stage. 

In  Goad’s  system,  an  objea  is  described  by  a  list  of  edges  and  a  set  (tf  visibility  conditions  for  each  edge.  >^biliqr  is 
determined  by  checking  visibility  at  a  representative  number  of  viewpoints  obtamed  by  lessellating  the  viewing 
sphere.  Object  recognition  is  performed  by  aprocess  trf  itcrativdy  matctog  objea  and  image  edges  until  either  a  sat- 
isfactory  match  is  found,  or  the  algorithm  feils.  The  sequence  of  nuadtings  is  compiled  during  the  off-line  analysis 
phase.  Goad’s  system  was  not  completely  automatic,  however:  Goad  selected  edges  as  the  feannes  to  be  used  for  rec- 
ognition,  and  the  order  of  edge  matching  was  specified  by  hand. 

The  3  DPO  sy rtem  of  BoUes  and  Horaud  [4]  was  built  with  the  intended  goal  of  using  off-line  analysis  to  produce  the 
fastest,  most  efficient  on-line  objea  recognition  program  possible.  3IVO  utilized  the  local-featme-focus  method,  in 
which  a  prominent /ocitj  feature  is  initially  identified,  and  then  secondary  feamres  predictBd  fitom  the  focus  feature 
are  used  to  fine-tune  the  localization  result  The  system  was  not  fully  automatic,  as  the  focus  features  and  secondary 
features  were  chosen  by  hand. 

Dceuchi  and  Kanade  [14],  [16]  first  pointed  out  the  importance  of  sensors  as  well  as  objects  in  order  to  pre¬ 
dict  appearances,  and  noted  that  the  features  that  are  useful  for  dqiend  on  the  sensor  being  used.  Their 

system,  which  will  be  elaborated  on  in  section  3  predicts  oiijea  lypearancea  at  a  representative  sa  of  viewpoints 
obtained  by  tessellating  the  viewing  sphere.  The  appearances  are  grouped  imo  equivalence  classes  with  reqiea  to  the 
visible  features;  the  equivalence  dasM  are  called  aspeeu.  A  recognition  strategy  is  generated  fiom  the  aspeca  and 
their  predicted  feature  values,  and  is  rquesemed  as  an  nee.  Each  interpretation  tree  apexes  the 

sequence  of  opermions  required  to  pred^y  localizB  an  tfigect  The  aeqpmnce  of  operations  is  brtdcea  tqi  into  two 
parts:  the  first  part  classifies  an  input  image  imo  an  instance  of  one  of  die  aqiects,  while  the  second  part  determines 
the  precise  pose  (position  and  orientation)  of  the  ob]ea  widtin  the  spedfled  aspect 

Hansen  and  Henderson  [10]  demonstrated  a  system  that  analyzed  3D  geometric  properties  of  objects  and  generated  a 
recognition  strategy.  The  system  was  developed  to  make  use  of  a  range  sensor  for  recognition.  The  system  examines 


object  TirPf*****^  ^  *  lepwacnmive  set  of  viewpcniits  obtiined  by  teMdlsthig  die  viewing  qibere.  Geometric  fea- 
tnres  at  each  viewpoint  are  examined.  »d  the  propenms  (tf  robosai^  compleiBiiess,  coosisieiicy,  cost,  and  unique- 
ness  are  evaluated  in  onier  to  select  a  compl^  and  consistem  set  of  Ceatures.  For  each  modd.  a  strategy  tree  is 
constructed,  which  describes  the  search  strategy  used  to  recognize  and  localize  objects  in  a  scene.  Each  strategy  tree 
fifst  uses  the  strangeai  set  of  features  to  identify  the  object  and  aspect,  uses  seooodaiy  features  to  conoborate  the 
select  identification,  and  then  to  find  the  exact  pose. 

The  system  of  Annan  and  Aggarwal  [1]  was  designed  to  be  capable  of  adecting  the  proper  sensor  for  a  given  task. 
Starting  with  a  CAD  modd  of  an  object,  the  system  builds  tg>  a  tree  in  which  the  root  node  iqMesents  the  object,  and 
the  leaves  represent  features  (where  features  are  dependent  upon  the  sensor  selected),  and  a  path  from  die  root  to  a 
leaf  passes  throu^  nodes  representing  increasing  specificity.  For  example,  starting  at  the  root,  a  path  could  lead  to  a 
node  representing  the  sensor  properQr  (shape,  color,  reflectance....),  then  to  a  feature  class  node  (surface,  bound¬ 
ary,...),  and  so  on  down  to  a  leaf  node  that  rqseaoits  a  pardcnlar  feamre  of  the  object.  Each  arc  in  the  tree  is  weighted 
by  a  “reward  potenriaT  that  rqxesents  the  l^y  gain  firan  traversing  that  imk-  At  run  time,  die  system  traverses  the 
tree  from  the  root  to  the  leaves,  choosing  the  branch  with  die  hi^iest  wdght  at  each  levd,  and  badoracking  when 
necessary. 

The  FREMIO  system  of  Canqis.  et  al  [5]  predicts  object  aiqiearances  under  various  conditions  of  limiting,  viewpoint, 
sensm.  and  image  processing  operators.  Unlike  other  systems,  FREMIO  also  evaluates  the  utility  of  each  feature  by 
analyzing  the  detectability,  reliability,  and  accuracy.  The  predictions  are  then  used  by  a  probabilistic  matching  algo¬ 
rithm  that  performs  the  on-line  process  of  identification  and  localizadon. 

The  BONSAI  system  of  Flynn  and  Jain  (7]  identifies  and  localizes  3D  objects  in  range  images  by  comparing  rela¬ 
tional  grqihs  extracted  from  CAD  models  to  relational  graphs  omstructed  from  range  image  segmentation.  The  sys¬ 
tem  constructs  the  relational  gnphs  (tf-line  using  two  techniques:  first,  view-indqiendem  featnres  are  calculated 
direcdy  from  a  CAD  model;  second,  synthetic  images  are  constructed  for  a  rqaesentative  set  of  viewpoints  obtained 
by  w-wJiaring  the  viewing  sphere,  and  the  predicted  areas  of  patches  are  determined  and  stored  as  an  attribute  of  the 
appropriate  rdatioimi  graph  nodei  During  the  on-line  recognition  phase,  an  interpretation  tree  is  constructed  which 
represents  all  passible  matchings  of  the  graph  consmicted  from  a  range  image,  and  the  stared  model  graph.  Recogni¬ 
tion  is  performed  by  heuristic  search  of  the  inteipretation  tree. 

Sato,  et  al  [20]  demonstrated  a  system  for  recognition  of  specular  (Ejects.  This  system  will  be  (fiscnssed  more  com¬ 
pletely  in  section  4.  During  an  off-line  phase,  the  system  generates  synthetic  images  from  a  rqpsesentative  set  of 
viewpoints.  Specularities  are  extracted  fim  each  image,  and  the  images  are  grouped  into  aspects  according  to  shared 
specularities.  and  each  specularity  is  evaluated  in  terms  of  its  detecttibility  and  reliability.  A  deformable  template  is 
also  (separed  for  each  afreet.  At  execution  time,  an  input  image  is  classi^  into  a  few  possible  aspects  using  a  om- 
tinuous  classification  procedure  based  on  Dempster-Shafer  theray.  Rnal  verification  and  localization  is  performed 
using  deformable  template  matching. 


2.2  The  Common  Threads 


After  reviewing  the  different  appearance-based  systems  that  have  been  constructed,  it  is  useful  to  go  back  and  point 
out  the  common  processiiig  steps. 

2.2.1  Two  Phases  of  Processing 

Each  of  the  systems  discussed  employ  two  distinct  phases  ol  processing.  The  first  phase,  varioosly  called  aff-lme, 
compilation,  or  anatysis,  consists  of  analyzing  objea  appearances  and  constracthqi  recognition  strategies.  The  sec¬ 
ond  phase,  called  on-Une,  nn-time.  at  execution,  consists  of  applying  the  strategies  generated  in  the  first  phase  to  an 
•etnai  recognition  task,  b  general,  oomputational  efScrency  is  not  a  concern  in  the  first  phase,  since  it  does  not 
directly  affect  the  actual  time  or  effort  requiied  to  perform  ob^  recognition,  and  the  cost  is  only  incurred  once.  In 
contrast,  the  second  phase  is  expected  to  execute  many  times  as  part  of  an  apiriication.  and  consequently  must  be  effi- 
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deu.  In  effecu  the  time  spent  during  smtegy  geneatkm  can  be  amcntized  over  the  munber  ci  execnnons  of  the 
resuliam  strategy. 

2JJ2  Explicit  Object  and  Sensor  Modds 

Any  object  recognition  system  must  match  the  appearaaoerf  at  object  witfiieapect  ID  some  aensoi;  to  a  model  of  the 
object  ConsequenUy.  to  automatically  generate  a  prognm  &r  objeanoognitiao.  it  is  necessary  to  predia  and  ana- 
lyze  object  appeaiances.  Objects  qipeariliffiErmtlymdHreimwMnw«ft  in  nwWf^pY^Wnhjwrrajipi^rnnri>s,  sm- 
SMsmustbemodeledaswelL  An  appearance-baaed  syaiemtfagefinmiiirinitea  both  object 

The  early  appearance-based  systems  only  made  use  o(  expiictt  object  30^  wtiitTwd  implicit  sensor  m^^dfls. 
although  the  need  for  different  types  of  models  10  represent  diffierent  typM  fiBninre*  wiw  nrlmon/Wfgii^ 

All  recent  systems  have  emphasized  the  fact  that  appearance  qxm  the  aemw  and  inriiirie  explicit  mndei^  q[ 

both  objects  and  sensors.  Modds  may  be  exchanged  so  that  the  same  VAC  can  gnnmite  object  reo^nitioo  programs 
for  a  variety  of  objects  and  sensors. 

1,13  Appearance  Prediction  and  Analysis 

In  general,  there  are  two  ^ipraaches  to  predicting  object  tqipeanmoes;  analytic  and  synthetic.  The  analytic  qiproach 
uses  the  information  stored  in  objea  and  sensor  modds  to  andytieally  predict  the  Tpnafnee  of  an  objea  from  vari¬ 
ous  viewpoints.  Altemarivdy.  it  is  possible  to  generate  images  of  objeM  under  qiecific  sensor  conditions  and  andyze 
the  synthetic  images.  Both  techniques  have  advantages  and  diaadvaiuagea  «*i«t  me  diimatftd  more  comjdetdy  in  the 
next  two  sections. 

The  appearance  of  an  objea  varies  with  respea  to  the  sensor  used.  The  appearance  of  an  olject  with  lespea  to  a  sen- 
sor  is  characterized  by  means  of  the  features  that  can  be  extracted  from  the  aeoaor  image.  each  sensor  mfvM 
includes  a  feature  seL  aid  a  oolleoion  of  image  pmcAwingnpiwitofitiiatairrnSffd  to  «««ma  the  feainrftt- 

The  appearance  of  an  objea  also  varies  with  respect  to  the  relative  geometry  between  sensor  and  object,  whidi  can  be 
referred  to  as  the  viewpoint  PoientiaOy,  there  are  six  degrees  of  freedom  in  viewpoint,  each  of  which  qians  an  infinite 
number  (rf  parameter  values.  Qearly,  exhaustive  compulation  of  all  p«iv«bi«  qipeamoes  is  inqxxsible.  Tb  make  the 
set  of  possible  appearances  manageable,  similar  appearances  are  grouped  into  sets  called  aspects.  Fannafiy,  an  select 
is  a  class  of  uqioiogicaUy  equivalent  views  of  an  objea  [17].  However;  since  (fifferent  sensors  detea  different  fea¬ 
tures,  the  formal  definition  of  an  aspect  is  usually  leliuted  to  be  a  class  of  qipeacances  that  are  equivalent  with  re^jca 
to  a  feature  set 

A  substantial  amouiu  of  work  has  been  performed  on  deriving  "w*h«vlf  fry  snalytically  detennining  the  wtiWt***!  of 
aspects  of  an  object  For  example,  Plantinga  and  [19]  compute  the  exaa  sa  of  a^ects  for  pdybedn  under 
either  orthographic  projection  or  perspective  projection,  udng  the  deftnitinn  of  a^ects  as  topologically  eqahralent 
views.  Kriegman  and  Ponce  [18]  compute  the  exaa  stt  of  aqiects  fiar  solids  of  revohition  mder  orthographic  prqjec- 
tion,  again  using  the  definition  of  aspeos  as  topologrcafiy  equtvaleoc  views.  Oien  and  Freeman  [6]  determine  the 
exaa  aspects  for  quadric-surfaced  solids  under  perspective  projeciion,  where  aqtects  are  views  with  isomotidiicline- 
junction  graphs. 

An  alternative  to  the  exaa  snalytic  computatkm  of  aqtects  is  the  exhoiuchpe  qpjprmicfi,  in  which  viewpoints  ate  sam- 
pled  uniformly  throughout  the  apace  of  poasfifie  viewpohtts,  and  then  viewpoints  are  groqred  together.  An 

sqiproach  of  this  sort  was  usedby  Dreuchi  and  Knade  [l<n.HBaen  and  Henderson  [10],  Flynn  and  Jain  [7],and  Saro, 
aal  [20].  In  each  of  these  systems,  the  space  of  poarfUeviewpoiiMs  is  umfixmlyteadUaied,  and  die  appear- 

ance  is  predicted  from  each  viewpoim  coneqxmdmg  tr  dm  ceaer  of  a  lesseL  The  fiddity  of  the  sam^ing  can  be 
increased  by  subdividing  each  tes^  This  approach  is  more  general  than  the  analytic  qrprondi,  since  the  same  prooe- 
dnre  can  be  used  independently  of  the  sensor  or  feature  set 


2J2.4  Generatiimofa  Rendition  Strategy 


Tlie  result  of  the  ofT-Une,  compitorion  phase  of  an  appearance-based  system  is  a  strategy  for  object  recognition.  The 
strategy  is  often  represented  in  the  fonn  of  a  tree  that  represents  die  of  nr«H«rinn<t  to  perfiam  at  each  ffrp  of 

the  recognition  piocess.  Shice  the  generation  of  a  strategy  is  perfonned  off-line,  it  is  possible  to  petfmm  relatively 
eiqiensive  optimizatioo. 

There  are  many  different  computational  approaches  that  can  be  enqihiyed  for  object  recognition.  Suetens,  et  al  [22]  is 
a  recent  survey  of  the  range  of  approaches.  A  VAC  can  be  constructed  for  any  given  igipioach.  Consequently;  there  is 
no  standard  form  fiar  the  recognition  strategy  output  by  a  VAC;  all  that  can  he  ***«<  is  the  strategy  consists  of  exe¬ 

cutable  code. 

The  strategy  generated  on  the  basis  of  analysis  of  predicted  appearances  is  rqieatedly  executed  in  an  triplication. 
Since  the  strategy  is  executed  repeatedly,  tqitunizatuMis  perfonned  in  the  off-line  phase  are  essentially  amnrfiyf/t  over 
the  many  on-line  executions. 
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3  A  VAC  Utilizing  Analytic  Feature  Prediction 

The  VAC  diacusMd  in  this  sectioa  was  designed  10  geaeniB  an  object  localizatian  pragiam  for  a  bin'pickmg  task,  and 
was  initially  presented  in  ( 14],  a  which  time  the  system  was  not  folly  antooiatk^  Fdidier  reseaidi  has  led  to  omvlete 
automation  of  progiam  geneiation  [IS],  as  weU  as  optimiTaaion  of  the  resnltmg  code  [12].  This  sectum  win  present 
the  ctwnplete  sysiem.The  inputs  to  the  VAC  consist  <rf  an  object  model,  a^Mctfymg  genmgtric  and  phnfofiyjric  rhffnr- 
terisiics  of  the  object,  and  a  sensor  model,  specifying  the  sensor  characierisiics  necessary  for  predicting  objea 
^ipeatances  and  feature  variatitms.  The  output  consists  oi  a  recognition  stratify  m  the  form  of  an  interpretation  tree. 

A  localization  task  is  salved  in  two  phases,  as  is  chaiacterisiic  of  ^tpeaesnce-based  systems.  In  the  first,  off-line 
phase,  the  object  and  sensor  models  ate  used  to  prerfict  object  TpMfwfs  and  the  vatiaticn  in  features. 

The  result  of  the  first  |fiiase  is  a  tecognition  program  for  the  given  ta«ir.  Hie  second,  on-line  phase  consists  of  apply¬ 
ing  the  generated  program  to  the  actual  task. 

A  3d  object  can  yield  an  infinite  number  of  2D  appearances  fior  a  given  aeninr.  resulting  firom  changes  in  viewpoint 
But  for  any  given  sensor,  there  are  a  finite  number  of  qualitatively  rfilfewint  dtatacterisdc  qjpearances  that  can  be 
termed  aspects.  An  input  image  can  be  classified  into  an  of  an  based  on  the  ranges  of  the  feature  val¬ 

ues  that  differentiate  between  aspects;  thia  pmeerinm  i«  ealIM  n«pi»rt  rhi«^n<-atin«i  Sinr-ft  myicK  nm  esiMiitiaily  mt. 
lections  of  viewpoints,  aspect  classification  is  equivaiem  to  rough  localization.  More  accurate  localization  can  be 
performed  by  an^yzing  the  shqie  change  within  an  aagiect,  called  linwir  ahap*-.  tWerniinarifwi  Tlw  Ingaliratinii 

program  generated  by  the  system  reflects  the  distinction  bmween  ^obal  and  local  shqte  changes  by  separating  the 

processing  imn  «ages  nf  a<pert  rla«<ifirarinn  and  linear  shapn  ehawgf  tffippninarinn 

3.1  Explicit  Object  and  Sensor  Models 


Computer  vision  systems  can  use  many  different  types  (ff  senson.  A  sensor  can  be  considered  to  be  a  transducer  that 
transforms  object  features  into  image  feamres.  and  different  senson  yield  different  image  features.  For  examide,  a 
laser  range  finder  detects  the  range  and  orientation  of  object  surfooes,  while  edge-based  binocular  stereo  yields  the 
range  computed  by  trianguladon  on  detected  edges.  ThUe  1  summarizes  various  sensors  in  terms  of  detectable  object 
features.  A  sensor  model  must  specify  the  features  detectable  by  the  season 

The  list  of  features  describes  the  qualitative  characteristics  of  a  sensoc.  The  quantitative  characteristics  are  given  by 
the  detectability  and  reliability  of  each  feature.  Detectalulity  the  c-np«tit»n«M  under  v^iich  a  given  feature  can 

be  delected.  Reliability  specifies  the  expected  error  in  the  feature  value.  Both  ghamfiteristiKs  dqiend  on  the  configura¬ 
tion  of  the  object  feature  and  the  sensor. 

The  detectability  of  a  feature  by  a  given  sensor  depends  on  foctors  such  as  range,  relative  attitude,  reflectivity  and  so 
on.  In  many  applications,  such  as  industrial  workstatioas,  many  of  the  foctors  can  be  fixed,  and  relative 
becomes  the  d^inant  factor.  Tb  consider  relative  arrimtte,  fix  ite  sensor  coordmate  system,  and  consider  the  rela¬ 
tionship  of  a  feature  coordinate  system  with  respect  to  it.  The  feature  coordinate  system  is  defined  such  that  the  z-axis 
is  aligned  with  the  surface  normal;  X  and  y  axes  ate  assumed  to  be  defined  arbitrarily.  There  are  three  degrees  of  fiee- 
dom  in  orientation  of  feature  coordinates  with  respect  to  sensor  GOonSnaies. 

Consider  a  solid  unit  sphere,  called  the  orientation  qthere,  or  o-sphert,  in  which  each  rdadve  orimitation  of  the  fea¬ 
ture  coordinate  system  corresponds  to  a  poitu.  The  directioa  from  the  center  of  the  qrhere  to  the  point  defines  the  ori- 
emation  of  the  feature  z-axis.  The  rotatioo  of  feaiiue  coordiiiates  around  the  z-aods  to  distance  firom  the 

spherical  surface  to  the  ceiuer  of  the  sphere;  a  poim  on  the  surfoce  rqxesents  a  fieatine  coordmate  system  obtained  by 
roouion  around  an  axis  perpendicular  to  the  p^  formed  by  the  north  ptrie,  the  center  of  foe  ^idiere,  and  the  surface 
point  itself.  The  north  pole  is  taken  to  be  foe  case  in  ufoich  feature  and  sensor  coordinates  are  aligned. 

One  o-sphere  can  be  defined  for  each  object  feature,  and  is  referred  to  as  foe  feature  configuration  space.  Rgnte  1 
illustrates  the  concept 
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Ikble  1:  Summaiy  of  Sensors 


F^nre  1:  Feature  confignratkni  space. 

(a)  Reiatkmship  between  sensor  coordinates  and  feature  coordinates,  (b)  Feature  coordinates 
as  points  on  the  o-spbere.  The  bottom  left  drawing  depicts  the  coordinates  corresponding  to 
points  on  the  surface  of  the  o-spbere,  while  the  bottom  right  drawing  depicts  the  comrdinates 

along  one  axis  of  the  o-sphere. 

A  sensor  system  consists  (rf  two  components:  an  Uluminant,  and  a  deiectoc  For  a  feamre  to  be  detected,  it  must  be 
visible  to  both  components.  For  a  given  featnie,  a  separate  confignntion  space  can  be  defined  for  each  sensw  compo- 
nent  Within  each  confignniion  space,  the  configuiaiians  for  which  the  feature  is  detectable  can  be  easily  defined  by 
geometric  consttamts.  For  exanqde,  a  plane  is  detectable  by  a  conventiooal  tv  camera  if  the  smface  normal  of  the 
plane  forms  an  obtuse  angle  with  the  camera  line  of  sight  The  detectability  of  a  feature  with  reflect  ID  a  given  sensor 
is  then  the  intersection  oi  the  detectable  regions  of  the  configuratkm  qmces  of  each  ctf  the  sensm  components.  An 
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example  of  ihe  detectability  compotatioD  for  a  li^t-stripe  range  finder  is  shown  in  ngme  2. 


Figure  2:  Detectability  of  a  face  for  a  iight*8tiipc  range  finder.  The  detectabk  regitm  is  the 
intersection  of  the  detectability  of  the  fflaminantandthe  detectability  of  the  sensor. 


For  a  given  viewpoint,  the  appearance  of  an  object  with  respect  to  a  particular  sensor  can  be  defined  by  a  list  of  the 
detectable  object  features,  almg  with  the  values  of  parameters  extracted  Cnm  diose  features.  A  viewpoint  corre¬ 
sponds  to  a  single  point  in  the  configuration  space  of  each  feature  of  the  object  A  feainre  is  detectable  if  and  only  if 
the  point  representing  the  sensor  viewpoint  lies  in  the  detectaUe  legkm  of  the  feature  configuration  space,  and  if  no 
other  pan  of  the  object  occludes  the  feature.  These  cooditioos  can  easily  be  checked  using  a  geranetric  modeler.  Hg- 
ure  3  illustrates  the  process  for  a  simple  polyhedral  objea  and  a  lighl'Stt^  nnge  findet 


Figure  3:  Use  of  detectability  constraints. 

(a)  Polyhedral  object  and  light-stripe  range  finder,  (b)  Detectability  constraints  of  a  face, 
(c)  Application  of  detectability  constraints,  (d)  Detectable  faces. 


3^  Predict  and  Analyze  Appearances 


The  techniques  presenieJ  itave  make  it  possible  to  analytically  detennine  die  detectability  of  any  feature  from  any 
viewpoint  The  combination  of  features,  along  with  {nedicted  feature  values,  defines  the  qipearance  of  the  object 
Next  the  capability  to  predict  appearances  must  be  u^  to  determine  the  aqiects  cf  the  object 

For  many  feature  sets,  analytic  approaches  to  determining  the  aqxcts  of  an  object  have  been  derived  ([6],  [18],[19]). 
However,  each  approach  is  speciidized  for  a  specific  feature  set  and  a  limited  coOection  of  surface  types.  An  alterna¬ 
tive  approach,  known  as  the  exhaustive  approach,  is  to  examine  a  rqnesentattve  set  of  viewpoints  around  the  object; 
this  approach  is  independent  of  the  feature  set  and  surface  type.  As  the  sample  sm  grows  and  die  placing  between 
samples  decreases,  the  results  fitom  the  exhaustive  aj^iroach  win  agree  arbitn^y  closely  widi  the  analytic  results. 

The  exhaustive  approach  is  relatively  easy  to  implement,  especiaUy  for  in  which  the  distance  between  sensor 
and  object  is  assiuned  fixed.  Then,  aU  possible  viewpoints  can  be  rqaesented  as  points  on  the  surface  of  a  qihere  cen- 
tered  on  the  object  A  tessellation  of  the  ^here  using  a  geodesic  dome  which  divides  die  s{diete  into  many  gnmii 
q;>herical  trian^es  yields  a  nearly  unifonn  sampling  of  viewpmnts.  The  triangle  can  be  subdivided  rqiea^y  to 
yield  any  desired  level  ai  sampling  resolutitm.  At  the  center  of  each  spherical  triangle,  the  detectability  of  each  fea¬ 
ture  can  be  computed  as  outlined  above. 

Aspects  can  be  selected  in  many  difierent  ways,  dqientUog  upon  the  features  bmng  considered.  For  example,  aspects 
can  be  defined  as  coUectkms  of  viewpoints  fm  which  the  same  set  of  feamres  ate  visible.  Aheroativeiy,  as  used  here, 
aspects  can  be  defined  on  the  basis  of  detectable  faces.  Consider  ai  abject  with  N  fimes  ^danar  or  curved)  S), 
S2,.~,S)f  and  define  the  face-bd)elX«(X|,  X2,.~,X].f).  where  1  orOaccordingtowhedierornotfiiceSiisdetect- 
aUe.  Viewpoints  with  identical  face  labels  are  grouped  together  into  aqiecis.  For  each  aqiect,  a  rqnesentative  attitude 
is  selected  and  used  to  calculate  representative  feature  values.  Each  aspea  can  be  rhanMraimA  ^  the  feauire  values 
of  the  representative  attitude.  Furtto,  the  ranges  of  feature  values  can  be  obtained  by  examining  the  range  of  values 
of  the  feanues  each  of  the  viewpoints  constituting  an  aspect  Figure  4  fllnstraies  the  process  view  generation  and 
aspect  selection. 


3^  Generation  of  Recognition  Strategy 


The  generation  of  a  recognition  strategy  dqiends  to  some  extent  iqxm  the  sensor  used,  or  at  least  iqxm  the  features 
used.  In  this  section,  results  are  presented  for  sensors  which  produce  dense  range  maps. 

3J.1  Aspect  Classification 

Aqiect  classification  is  the  process  of  classifying  an  input  image  into  an  influx-  of  an  aqiect  Since  an  espect  rqite- 
sents  a  contiguous  set  of  viewpoints,  aspect  classification  is  eqinvakot  to  rough  localbation.  The  parameters  of  object 
pose  determined  through  aqiea  classification  also  provide  good  starting  parameters  for  the  stage  of  linear  shqie 
change  determination  that  follows. 

One  way  of  performing  aspect  classification  is  ui  extraa  feature  values  fitom  the  hqmt  image  and  compare  this  set  of 
values  to  the  stored  value  ranges  tha  characterize  aqiects.  This  approadi  may  be  very  jnefificient,  however,  since  only 

afewofthefearmesnmy  be  needed  to  perform  clasrifigflrion.  yet  «n  am  rnnqi^iteH-  A  mn«ecnat.effecrivefl|ymiirhig 
to  determine  the  compiuatiaii^  cost  of  each  feature,  and  then  determine  a  discriminating  set  oS  fiatnres  diat  mini¬ 
mizes  she  expected  cost  classification. 

A  cltus^orion  tree,  or  rieosioR  oee,  is  a  tree  in  which  each  node  rqiesents  a  ctrilection  classes  and  an  associaied 
test,  and  arcs  represent  the  possible  results  of  a  test  Leaf  nodes  iqaesenis  the  final  results  trf  classification.  Using  a 
clasrificationtree,aclassificatinnispetfoniiedby  traversing  the  tree  from  the  root  to  a  teat  A  classification  tree  pro- 
vides  a  convenient  fitamewodt  for  optimizing  the  aspect  cbisification  process,  since  using  a  classification  tree  permits 
tests  (and  cotrespondingcongaitations  of  feature  values)  to  be  performed  sequentially. 


®  n  Hi  s  s  DO 


Figiire4:Extndio«ofMpcdi.  ^ 

(a)  Geometric  model  of  an  object,  (b)  The  GaMfani  sphere  is  tfsarnated  into  sixty  triangles  to 
represent  viewpoints  sampled,  (c)  Sixty  compnted  appearances.  Paces  snmamded  with  bold 

lines  are  detectable  by  photometric  stereo.  (tO^htcampooentflmes  to  he  used  for  shape 

labeling,  (e)  The  five  aspects  obtained  throogh  flamtilraliiin  by  shape  label  (I)  Representative 

attitndes,  one  fbr  each  aspect 

Aspect  classiticatiem  Bern  can  be  used  »  optimim  the  chmiflcaiion  pmceas  in  the  fioDowing  way.  In  me  off-line  stage 
of  processing,  the  entire  set  of  possible  aspea  classification  twes  is  examined  syUonsiicaBy,  and  the  mimminn  cost 
classification  tree  is  identified  and  saved;  this  tree  stores  fisamre  kkmifieis  md  lest  values  at  eat*  node. 


A  path  &om  the  root  of  a  classification  tree  to  a  leaf  represents  a  compiem  dassifleatioei  opentioo.  Compiiimg  the 
cost  of  a  single  path  is  straightforward.  Each  test  leqoiies  a  lisnime  to  be  conipowl,  and  ^  snch  computation  incurs 

.  rnmpytariniMii  rnq  Each  nodc  in  the  classification  tree  is  assipied  the  cost  of  computing  the  feature  needed  for  the 

com»^pnnriingiesL  The  cost  ofa  path  is  the  sum  rflhe  coats  oftiieiiUamediatB  nodes. 

A  classification  tree  cwitains  many  paths  from  the  root  to  the  kaf  nodes,  and  tfiffierent  paths  may  be  ito  ytidi  dififa- 
CTt&equencies.  Therefore,  the  cost  of  a  classification  tree  is  defined  to  be  die  expected  cost  of  a  dasrification;^  is. 
the  average  cost  taken  over  afi  possiWc  irqiots.  The  expected  com  can  be  compmed  by  weijhdtW  die  cost  at  each  notie 

by  the  proportion  of  the  sanqile  population  that  wiD  pass  dnoiigih  the  node.  The  cost  of  every  node  in  the  tree  is 

summed  and  divided  by  the  population  sixe  10  yield  the  ovenD  cost  Hgure  5  iDuitiaies  die  method  for  computing  the 

rest  of  an  mid  die  overall  com  of  adasaification  tree. 


tree  cm  be  fannalfatedm  a  search  proMem  over  anodier  kind  of  tree,  called  a 
stratervtrte  A  strategy  tree  for  aspect  dassification  is  a  Bee  in  which  each  path  fitom  the  root  to  a  leaf  rqnesents  a 

compSe  straregy  for  classificaiioo;  that  is.  each  path  in  a  snategy  tree  can  l»  expanded  iron  d 

ThftffoTf.  fifHfingrtiftigMtrnttrlastificationtieeiseQnivaientlofindingtlieleaatcostpathaomtootioaleafmdie 

strategy  tree. 


In  a  strategy  tree,  each  node  contains  the  results  of  applying  a  tesiusing  a  given  ffeatnre.  w^  arra  lepres^^tests. 

Each  arc  is  labded  with  the  product  of  the  cost  of  the  femnre  and  die  expected  number  of  ssBBples  tt>  which  the  test 
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Figure  5:  Cost  compntatioiis. 

(a)  Cost  of  classification  (b)  Expected  cost  of  a  rilassififaHon  tree 

will  be  applied.  An  arc  is  present  when  a  feaoire  can  be  used  to  break  up  a  set  into  smaller  sets.  At  the  leaf  nodes,  aU 
the  constituent  sets  should  be  singletnis.  unless  the  feature  set  is  incqpaUe  of  distingmshing  some  of  the  iiqnit 
classes. 

illustrates  a  strategy  tree  for  a  simple  case  consisting  of  4  and  3  featnres.  At  the  root,  all  die  are 
grouped  into  a  single  set  An  arc  is  presem  for  every  computable  feature  which  can  reduce  the  set  size,  so  diere  are 
three  arcs  at  the  root  The  dadcened  padi  in  the  tree  is  the  minimum  cost  path,  and  expands  into  the  gluMificatw"  tree 
shown  in  Rgure  5. 

There  are  cases  for  which  is  not  possible,  given  die  set  of  features,  to  distingiiisb  two  classes.  Indisdngnisliable 
classes  are  referred  to  as  congruent  dosses,  and  the  conesprading  nodes  as  congruent  nodes.  Since  the  rhnaKa  in  an 
aspect  classification  tree  represent  aspects  of  some  object,  the  existence  of  congruent  classes  means  that  the  conte- 
sponding  aspects  cannot  be  distingnished  with  the  availaUe  features;  snch  aqiects  are  refiened  to  as  congruent 
i^pectf.  Coigruem  aspects  do  not  rqaesem  a  failure  of  the  search  procedure,  but  rather  inc&aie  a  fundamental  limi* 
tatkm  of  the  feanire  set  In  many  cases,  the  linear  shape  change  detominaiioo  step  corrects  for  ambiguous  aspect  clas¬ 
sification  and  determines  the  conea  object  pose. 

332  Linear  Shape  Change  Dcterndnation 

The  aspect  classification  step  results  in  the  classification  ofanhiput  image  as  an  instance  ofana^iecL  Since  an  aqiea 
consists  ai  a  contignoas  oiAectian  of  views  of  an  object,  aspect  classificatioo  is  equivalent  to  nxigh  localization;  the 
possible  ctdlection  of  objea  poses  are  limited  to  those  consistent  with  the  observed  aspect 


page  14 


caoieii 

i  !S>S*AOI|«40»OICAC«I 
iKCLOiCaiOl^lOiAOIfOII 


Pi»40)«l 

«i»(Ol«l 

piV<C4»*l 


rnmirntmU 


Figure  <:  Strategj  tree 

The  next  step  in  the  localization  process  detennines  the  exact  pose,  givea  the  «n«riai  obtained  fiom  aq>Brt 

classification.  The  same  set  of  features  is  visible  ihroaghout  aa  aqiect,  ao  no  non-linear  events  as  the  appwffff»ri» 
or  disappearance  of  a  feature  occur.  Therefore,  the  stq>  of  detennhuqg  the  pose  within  an 

aspect,  subject  only  to  linear  changes  (rotation  and  translatian)  of  fieatnres;  this  stq>  in  the  inmKwirinii  process  is 
known  as  linear  shape  change  deterndnadon  (LSQ)). 

One  way  to  perform  LSCD  is  to  utilize  a  modd'based  ^tpraacb  in  uduch  ^*g*  featares  are  mnwtwt  (q  model  fea¬ 
tures,  a  pose  is  hypothesized  and  used  to  predict  kxations  of  features  in  the  image,  and  the  pose  is  refined  by  comput- 
ing  the  error  in  predictions  and  observatiaos  and  updating  die  pose  appropriaidy.  hi  most  model-based  systenos,  the 
matching  stage  is  very  difficult,  since  all  posriUe  matches  between  image  and  model  fenturaa  mma  be  investigated. 
Since  aspea  classification  has  been  performed,  however,  the  coneqxmdenoe  between  some  set  of  image  and  model 
features  has  already  been  established.  In  partkolar,  the  assumptions  underiying  the  LSCD  method  presented  here  are: 

■  the  correspondences  between  model  and  image  fimes  are  known; 

*  the  correspondences  between  model  and  image  edges  are  unknown. 

The  aspect  classification  strategy  presented  in  the  previous  snbsectioo  was  encoded  in  the  form  ttf  a  tree,  the  classifi¬ 
cation  tree.  Each  leaf  node  of  the  tree  rquesents  an  atpect  or  collection  of  congruent  aspects,  and  for  each  leaf  node  a 
different  LSCD  strategy  may  be  appropriate.  The  VAC  presented  here  compotes  a  aqwaie  LSCD  strain  for  each 
leaf  node  of  the  aspect  classification  tree.  The  atqs  in  each  LSCD  strain  ate  encoded  as  nodes  diat  are  attached  to 
the  leaf  nodes  of  the  aspect  classtflcadon  tree.  Although  the  r*****”*^^  computational  procedures  vary  between 
aspects,  the  same  steps  are  fdlowed  in  the  same  order  for  eadi  aspect: 

1.  determine  the  coordinate  sysmm  of  the  primal  face  (die  visible  tee  with  the  largest  3D  area): 

1.1.  determine  the  origin  the  primal  tee; 

1.2.  determine  the  z-axisorieniatiooofihe  primal  tee; 
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1.3.  detennine  the  x«axis  orientation  of  ibe  primal  fiKe: 

2.  eshmaie  the  body  coordinate  system; 

3.  establish  comspondences  between  image  and  model  edges; 

4.  recover  exact  body  coordinates  by  numerical  minimization. 

The  LSCD  strategy  is  determined  off-line  through  the  following  st^ 

1.  For  each  aspect,  the  visible  face  with  die  largest  3D  area  is  selected  as  the  proitoZ/oce; 

2.  Each  primal  face  is  analyzed,  and  a  method  to  defining  the  face  system  is  determined.  Sqia- 

rate  nodes  are  attached  to  the  classification  tree  which  define  the  exact  procedures  used  in  each  individual 
step  of  determining  the  origin,  z-axis,  and  x-axis  of  the  primal 

3.  Given  the  estimated  coordinate  system  of  the  primal  face  and  the  transformation  between  primal  face 
coordinates  and  model  coordinates,  the  body  cooidiiutte  system  can  be  estimated  with  reflect  to  the  sen¬ 
sor  coordinates.  This  is  encoded  as  a  sqnrate  node  of  the  tree. 

4.  Knowing  the  objea  aspect  and  a  rough  estimate  of  die  body  coordinate  system  enables  the  prediction  of 
the  locaticm  of  model  edges  in  the  image.  The  predicted  edges  can  then  be  matched  to  observed  edges. 

This  process  is  encoded  as  a  separate  node  of  the  tree. 

5.  A  fine-tuning  procedure  is  used  to  determine  the  exact  body  mnMnatr*  by  adjusting  the  estimated  body 
coordiruaes  so  that  image  edges  exactly  match  predicted  model  edges.  An  exact  match  is  not  possible,  so 
the  procedure  finds  the  body  coordinates  that  minimizes  the  error  between  predicted  and  observed  edge 
locations.  This  procedure  is  encoded  as  a  sqnrate  node  of  the  tree. 

333  The  biterpretatkNi  IVee 

The  overall  strategy  for  object  kxalizatiao  is  encoded  in  die  tom  of  a  nee,  the  uiterprecarton  tree.  The  top  part  of  the 
interpretation  tree  consists  of  the  aspect  classification  tree,  and  oasists  of  ditections  to  a  series  of  feature  value  com- 
putations  and  tests  that  result  in  the  classification  of  an  input  image  into  M  instance  of  an  aqiecL  The  bottom  part  of 
the  interinetatioa  tree  is  six  nodes  deep,  and  consists  of  the  steps  in  the  LSCD  strain  that  are  appropriate  to  eadi 
aqiecL  Figure  7  and^gure  8  illustiate  the  interpretation  trees  for  two  otgects:  a  toy  car;  and  an  L-doped  ptdyhedron, 
respectively. 


3.4  Run*time  Execution 

Each  of  the  procedures  represenied  by  a  node  of  an  interpretation  tree  corresponds  to  an  etrecutable  objea  stored  in  a 
program  lilnary.  During  the  off-line  phase,  the  creatitm  of  a  node  of  the  interpretation  tree  is  accompanied  by  the 
UBUBitiation  of  an  executable  obieafiam  the  program  liteary,  and  the  insertion  of  the  ob|ea  at  the  node.  During  the 
on-line  phase,  these  instantiated  objects  are  executed  in  order  to  perfam  objea  recognition.  Message-passing  is  used 
to  communication  between  objects. 

For  each  execution  of  the  objea  recognition  strategy  during  the  on-line  phase,  a  ptqtroceased  image  is  passed  to  the 
toa  of  the  irtterptetadoa  tree  and  the  objea  stored  at  the  toa  is  invoked  The  execution  of  the  roa  objM  results  in 
the  compittatian  of  some  fbature  vrine  and  the  apiHication  of  a  test  The  omcame  of  the  lest  results  in  the  ptepro- 
cessed  image  being  passed  to  the  appropriaie  node  at  the  next  tevel  of  the  tree,  and  the  cocreiponding  stored  objea  is 
invoked.  The  sequence  of  me  iiagei  pasting  id  objeit  invocation  proceeds  until  a  leaf  node  is  reached,  indicating  that 
processing  is  compieie.  The  results  of  processing  are  then  passed  bade  up  the  tree  and  leouned  from  the  root 
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Figare  7:  Gcnentfam  of  aa  object  racogaUioB  strategy  ftar  a  tQj  car. 

(a)  Object  oMNleL  (b)  Object  «pects 
(c)  Aspect  claaiicatioB  tree  CaaipMe  iatcrpcctatioa  tree 

At  coognient  nodes  of  the  daisilication  pan  of  the  tree,  no  laiiqae  aspect  is  ideatifled.  lasttad.  seeend  aqiects  are 
determmed  to  be  possible.  For  each  oot^mett  aq)ect,  the  L^D  pan  of  the  tree  is  exeraMd,  and  the  leaidts  letnroed 
ate  compared  by  the  congnicm  node;  the  aspect  yielding  the  miniimBnenor  is  selecied  as  the  cenegintapretation 
and  this  lesoit  is  passed  back  up  to  die  toot  of  the  tree. 
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Fiforc  8:  GcMraiiM  or  n  objM  mogBitiM  Mcfj  hr  w  L^luq^ 

(■)  OltiMt  aodcL  (b)  OfcJcctwpMli. 

(c)  AiiMct  dMriflortiw  tm.  (d)  CoiapMe 

3^  Expoiments 

i««  Oluiti^  the  owciBioo  of  die  compOed  111^ 
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iniBeagivai  IB  agwaiMgeioMie.  howler  10  shtmiheicMormdBDenrliw.rfifaeVACl  the 
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mring  two  dUTerem  nnge  aenaon:  dual  photometric  stereo  (13],aiidaBErimla8eriioge  finder  [11]. 

3.5.1  Dual  Photometric  Stereo:  Toy  Car 

Hie  complete  imetpretatioa  tree  for  the  toy  car  is  preaeaied  in  Figure  7.  The  toy  car  was  placed  in  a  scene  on  top  a 
pile  of  other  objects.  The  input  car  scene  is  shown  in  Rgnre  9.  Rrepnoessing  snges  results  in  the  compuiatioa  of: 

•  A  needle  map  containing  the  gradient  space  values  at  each  pixel 

•  An  edge  map. 

•  A  label  map  indicating  thereon  to  which  eadipixd  belongs. 


Fignre  9:  Inpnt  scene  ftar  dnai  photanactric  stereo  csqpcrtaMnt 


The  aspect  classification  steps  perfocmed  are  illuaoaied  by  te  blacfc  nodes  shown  in  die  interpretation  nee  (tf  Figure 
10.  Starting  below  the  node  at  which  the  aspea  is  identified,  die  LSC3>  prooesaingb^ins.  The  first  node  determines 
the  mass  center  of  the  target  regioo  and  declares  dun  poshion  to  be  the  origbi  of  tbs  &ce  ooosdinaie  system.  The  next 
node  determines  the  average  surfiKe  orientation  of  the  target  leghm  and  declares  that  lo  be  the  orientation  ttf  the  z- 
axis  ci  the  primal  face  cooidinaie  sysmn. 

The  next  two  nodes  complete  the  determination  of  the  face  cootdtnme  sysiem  and  estiinaiB  the  body  coordinate  sys- 
tem.  Figure  11  illustraies  the  esdmaied  body  coonfinates  overlaid  oo  the  inpttt  image. 

The  next  step  in  the  process  is  the  determiiaKioo  of  the  et^  pahs  and  fine-mnhif  of  the  object  pose.  Hgore  12  illos- 
tiaies  the  results  of  these  processes. 

3.5.2  Erim  Laser  Ranfe  Flnden  L^baped  FOlyhcdroii 

The  complete  interpretation  tree  for  the  L'Shaped  polyhedran  is  presented  in  Hgnre  8.  The  L^haped  polyhedron  was 
placed  in  a  scene  and  aiEiim  image  obtained  and  presensed  to  the  root  of  the  interpretation  tree.  The  results  ate  pre- 


semed  in  Hgure  13.  Hie  resniis  tn  interaiiiiit  in  that  a  coiyneat  node  waa  encoonieied.  Hie  node  comained  two 
aipectt.  and  the  figaealiowa  the  two  poaribiemteqaetatioMomiaid  on  the  input  linage.  Baaed  on  a  compariaoocrf 
the  lesnUiv  enors,  die  ooanct  aqiea  and  poie  were  deiennined. 


Fifore  12:  Fiuri  dctcmiMtioB  of  object  p€M. 

(a)  Correspondences  bctwccB  Bodel  aad  loui|e  odfH.  (b)  Flaal  pooe  overlaid  OB  input  scene. 

4  A  VAC  Utilizing  Image  Synthesis  for  Prediction 

The  appeaiancc-baaed  system  of  Saio.etal  [20]  was  built  for  the  pwpose  of  rncognfaing  specalar  objects.  Specular 
objects  pose  a  special  problem  for  conqiaier  vision.  Specakridea  an  ofkai  the  most  prominent  image  features,  and 
yet  contain  no  brightness  vaiiatians  which  can  be  osed  fer  edge  detection  or  3D>  dtape  analysis.  Moreover,  apecnlar 
features  may  appear,  disappear,  or  change  shape  abngjcly  with  laaaD  variations  in  viewpoint  The  presence  of  a  spec- 
ularity  requires  a  precise  configuration  of  iDuminant,  svfece  and  —ww,  and  therefere  provide  a  powerful 
constraint  on  the  underlying  surfece  geometry.  However,  dm  oontnint  ia  pody  load,  and  does  Utde  ID  oonsiiam  the 
object  pose. 

Specular  reflections  are  found  in  neatly  evenr  imaging  scenario.  Meal,  giaaa,  p*— and  many  other  materials  are 
highly  specular.  In  addition  to  optical  images,  there  are  other  imaging  syaems  that  are  baaed  on  specular  reflection. 
Fbr  examide,  radar  is  based  on  .qieailar  teflectioii,  as  are  undBiwatar  and  sonogcqdiy. 

Therefore,  it  is  important  to  establish  techniques  to  tecognizB  otjeca  using  specidar  hnages. 

4.1  Explicit  Object  and  Sensor  Models 

An  analytic  apimacb  to  appearance-based  visian  is  hnpmcticai  in  the  domain  of  qiecolar  oltiects.  Becanse  (rf  inter- 
leflectioiis  between  shiny  surfaces,  analytic  prediction  of  qmcnlatitiea  is  extremely  dUBcalL  An  aheniaiive  is  the  use 
of  a  sensor  simulator,  wfaidi  generates  tfejea  appearances  bnaed  on  both  otjea  and  renaor  models. 

Sensor  simulmors  are  very  similar  to  my  tracers  of  canqxner  grqddcs.  A  3D  scene  Is  described  by  a  geometric  mod- 
eler.  A  sensor  simulaior  traces  the  path  of  light  rays  from  pixela  hno  dm  aoene.  Every  thne  a  r^r  hits  an  objea  sotfeoe, 
dm  ratios  <rf  reflected  and  transmitted  energy  are  compnmd  on  dm  baris  of  dm  stnfecereilecta^fiBnction  and  corffl- 
cient  of  refraction,  respectively.  The  sensor  sanniator  then  aaoes  both  dm  reflected  and  refracted  rays.  When  a  ray 
reaches  a  light  source,  dm  eneigy  emitted  by  dm  source  is  ^mdfled  by  the  sensor  modd.  The  inddem  eneigies  trf  an 
dm  rays  towards  a  pixd  are  sumned  to  desennine  dm  bri^ttimn  vahm  at  dm  pixd  and  therefore  pretfla  the  object 
appearance  at  the  pixd. 


ngsR  13:  Im  uner  lUBge  Funcr  BipcnoMBt 
(a)  Inpat  sccac.  (b)  Bcnits  of  aspect  ctanriflcilkM.  (c)  Tro  poaiUe  pOMS. 

(d)  Edfs  corropoadcBoes.  (c)  ResaUat  poet. 

One  critical  diffetence  between  a  senior  rinwlsior  and  a  nqr  naoer  is  tlat  aensor  simnlaton  mamtain  the  symbolic 
conespondences  between  regions  of  the  fanne  and  die  object  snrfmesmdBriying  the  legioos.  to  the  case  of  speodar 
olgects,  this  correspondence  pennits  the  amdysts  of  which  objea  sarfaes  protoe  strong,  stride  specnlarities.  and 
fnm  wito  directions  qiecidarities  are  visible  on  each  surface. 


Specnlar  featnres  are  chacacterirod  by  strong,  toinct,  and  saoirued  brigtaneas.  In  many  cases,  specularities  can  be 


page  22 


extracted  by  the  simple  prooediBe  of  Unary  threditd^ng.  Some  ^Kcidarities  arc  easier  to  delect  than  others,  how¬ 
ever.  In  general,  the  size  of  a  feature  detemdnes  the  ease  widi  which  it  can  be  detected.  For  exampk,  elongated  q)ec- 
ularities  on  a  cylindrical  surface  are  easy  to  detect,  while  a  secular  qnt  on  a  small  qthere  is  diffi^t  to  detect  Thus, 
the  detectability  of  a  specular  feature  is  related  to  the  3D  dope  of  the  surfine  underlying  the  feature. 

While  a  specular  feature  might  be  easily  detected,  it  could  be  quite  mmiaMft;  that  is,  the  specularity  might  be  visible 
only  over  a  small  range  of  viewpoints,  and  a  slight  change  in  viewpoint  could  cause  it  to  ^sappeac.  Such  stecularities 
are  termed  unstable,  and  are  poor  choices  for  use  in  recognition.  The  stability  d  a  stecnlar  fiBature  is  idated  to  the 
size  of  the  collection  of  viewpoints  from  udiich  the  feature  can  be  detected. 

To  make  the  concept  of  stability  mote  clear;  consider  a  co-located  and  li^  source.  Specular  reflectiaos  ate 
detecuble  when  the  object  surfaces  ate  nearly  perpeodkolar  to  the  gamem  line-of-sighL  Consider  the  motion  of  the 
camera/light  system  around  the  surface  ot  a  qiheie  centered  on  an  object,  with  the  line  of  sight  always  toward  the 
center  of  the  s)here.  If  a  small  specular  sihete  is  being  imaged,  a  qtecular  spot  will  be  observed,  and  win  continoe  to 
be  observed  for  all  viewpdnts;  hence  the  swculatity  arising  fitom  a  surface  is  extremely  stable.  Now  con¬ 

sider  a  cube  being  imaged.  Each  planar  surftce  only  yields  a  qtecularity  when  the  line  ttf  sight  Is  perpendicular  to  the 
surface,  and  the  specularity  disappears  for  small  changes  in  viewpoint;  hence,  specularities  fimn  planar  surfaces  are 
unstable.  The  area  on  the  viewing  sphere  corresMMxfit^  to  detectable  viewpoints  is  a  measure  of  stability  of  a  specu- 
lar  feature. 

Figure  14  illustrates  the  detectability  and  stability  for  specular  fiBatnres  over  four  tfifferentsnr&ceqrpes:  planar,  cylin- 
drical,  conical,  and  elliptical.  As  can  be  seen  in  the  figure,  planar  surfaces  have  easily  detectable  qiecnlarities  that  are 
low  in  stability,  while  qihetical  surfaces  have  low  detectdrility  buthi^  stabQiqr.  Cylindrical  and  conical  surfaces  fall 
somewhere  in  between. 
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4.2  Predict  and  Analyze  Appearances 


The  system  employs  an  exhaustive  method  for  appearance  analysis.  Using  a  sensor  simolatar.  the  system  generates 
synthetic  images  for  a  represenative  collection  of  viewpoints  obtamed  by  tBssellating  the  unit  sphere.  The  vjewpwiBS 
are  then  grouped  into  aspecu  on  the  basis  of  similar  fetunre  sets. 

Assuming  that  range  to  the  object  is  oonsmi,  afl  poBsOde  viewing  dbeciians  can  be  rqaeaenied  as  points  on  the  unit 
sphere.  A  geodwicpwtitioo  of  the  viewing  sphere  unjftwnlytfnBrilaiBS  the  sphere  into  small  irianijes.  The  center  erf 
each  triangle  is  chosen  to  repeesem  a  viewing  direction.  Triangle  can  be  fhrdwr  subdivided  into  smaller  triangies  to 
make  the  sampling  as  fine  as  deshed.  The  sensor  sinmiattr  is  then  used  to  generme  synthetic  images  at  each  rqiesen- 
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tative  viewpoint.  Figure  IS  shows  some  of  the  appearances  generated  for  a  simple  object 


Figure  15:  Specular  iaaagcs  of  a  sianpie  object 
(a)  Object  (b)  Synthesbedsampk  iaaagea. 


Each  image  is  processed  to  extract  specnlarides,  and  the  data  suuctuie  created  by  the  sensor  simulator  is  used  to 
determine  the  primitive  component  underlying  each  specolariqr.  For  N  primitive  conqxmeats  Pi.  P2.~..  a  cell 
label  can  be  defined  as  an  N*tig)Ie  (Xi.  X^)  such  that  3^  «  1  or  0  according  to  whether  or  not  compement  Pi 
gives  rise  to  a  detectable  specularity  at  the  viewpoint  rqtreseiaed  by  die  ceD.  Cdk  widi  identical  cell  labels  are 
grouped  together  to  form  aspects.  Thus,  in  the  case  of  specular  objects,  aspects  are  defined  widi  lespea  to  detectable 
specularities.  Gemnetrically,  the  process  can  be  considered  as  fcdlows.  For  each  primitive  conqxment,  a  geodesic 
dome  is  constructed  in  which  each  ceil  is  labeled  according  to  whether  the  cooqponent  gives  rise  to  a  specularity  firom 
that  viewpoint.  The  geodesic  domes  for  each  component  ae  then  inteiaected,  and  eadi  (fistinct  r^on  coneqnads  to 
an  aspect  Hguie  16  illustrates  the  selectkm  of  aspects. 


F^ure  16:  Aspect  selection  for  the  sini^  object 


43  Generate  Recognition  Strategy 


As  discuaed  above,  speculv  features  can  vary  in  their  cfancteristics  of  detectabiliqr  and  stability.  For  the  puposes 
of  Qbjea  recognbkm,  the  system  sons  specular  feiiores  on  the  baris  of  deiBctabiUty  and  riidiQity  hi  order  to  selea  the 
meat  effective  feMures  for  aspect  clamMcation. 


page  24 


Detectability  was  defined  above  as  the  measure  of  the  ease  with  a  qwcnlarity  can  be  detected,  and  was  related 
to  the  area  of  the  specularity.  The  system  uses  as  a  ineasare  of  detectabOhy  the  nnniber  of  pixels  of  the  laigest  tqtpear- 
ance  of  a  specularity,  nonnalized  by  dividing  by  the  area  of  the  largest  detected  specularity. 

Stability  was  defined  as  the  area  of  the  viewing  qtfaere  over  which  a  given  specobiity  is  detected.  In  the  case  of  a  tes- 
sellated  viewing  sphere,  this  measure  can  be  approximated  by  counting  the  wmwhiar  of  ceDs  wifiiin  which  the  ^)ecu* 
larity  is  detected,  nonnalized  by  the  total  numto  of  cells. 

An  evaluation  function  is  required  to  combine  the  of  detectability  and  stability  into  a  measure  of  overall 

feature  utility.  For  each  aspect,  the  features  are  ordered  by  decreasing  utility.  At  nm-time,  are  nniri«»  in  order 

of  decreasing  utility. 

Aspect  classification  is  equivalent  to  rough  localiTation,  Rner  localizatian  is  in  the  case  of  specular 

because  specular  features  change  their  shipe  diasticaQy  with  «w«ii  gimngt»!t  in  viewpoiitt.  Moreover,  the  eschaustive 
approach  used  in  aspect  determfauirinn  may  mm  aw  nwsMditeqwrailaritythfit  if  ^lyviirihtehetweimtwnr.^11^ 
quemly,  deformable  template  matching  was  selected  as  the  procedure  for  fine  Waittrarinn 

Deformable  template  matching  permits  the  template  to  deform  according  to  eettnin  ftfistmiiits.  An  appearance  is 
described  as  a  combinatkm  of  templates,  each  of  whiehdeigriheaaigiwqitfrity  Thfttwnipiat^arftintffrfffinectedcon- 
ceptually  by  springs.  The  quality  of  match  is  measured  by  the  ««"« of  the  internal  defionnation  energy  of  the  springs^ 
and  the  external  energy  needed  to  fit  each  template  to  a  teal  specnlarity.  Thns,  a  defonnaUe  template  can  to 
find  a  match,  even  when  a  speculariqr  changes  shqte  or  position.  Moreover,  matrhwa  can  still  be  made  even  in  the 
presence  of  accidental  appearances  or  missing  featnres. 

A  deformable  template  is  ptqtared  for  each  aqtect  using  the  appearance  which  is  located  at  the  center  of  the  aspect 
Speculariues  appear  as  iq«ts  or  line  aeginentx,  aw  #ariit«miplM»t«yw«ifti|qffp^faBdliww!ii^;w>iaitx,  Spirillar 
are  extracted  from  the  central  appeatanre  Fnr  aw  elnwgiMiid  femme,  ^  Hwf  if  fit  riie  frawme  and  nted  fn 
in  the  template.  For  a  spot  feature,  a  poiiit  located  at  the  center  of  die  fimnne,  is  uwd  to  iqaesent  the  feature  in  the 
template.  A  ctmcqxnal  spring  is  located  at  each  en^wint  a  line  feamre,  and  at  the  point  iqjtesenting  a  ^  feature. 
The  spring  energy  is  calculated  from  the  displacement  between  the  nrigiwai  and  mimiit  infftinn  of  die  spring.  Thus, 
the  energy  of  a  spot  feanire  is  a  function  of  tte  displacement  between  the  rwrmn*  and  nrigmni  position.  For  a  line  fea* 
turc,  the  energy  is  a  function  of  the  di^lacement  energy  of  die  two  endtxnnts. 


4.4  Run-time  Execution 


Run-time  execution  is  broken  into  tvm  distinct  stages:  aspect  rf— «Hiratuiw  awrf  verification.  In  cnwiraft  to  the  system 
discussed  in  section  3,  aspect  classificatian  does  not  nniqndy  classify  an  iwpre  iwMgw  as  an  iwgranr^.  of  an  aqiecL 
Rather,  aspect  classification  is  used  to  eliminaie  impossible  aspects.  Remaining  aspects  are  hqnit  into  the  next  stage, 
in  which  defonnable  template  matching  is  nsed  for  vetificatioa. 

4.4.1  Aspect  Classification 

Specular  features  am  be  very  unstable.  Small  changes  in  viewpmnt  nu^  r«n«w  a  given  qxcnlarify  to  qipeai;  dissp- 
near.  or  change  shape.  Conaequemlv.it  is  difflciilttnideBiifywMwgtey>mhiriiy,ririliw«»nrfqiBffiil»riti#!«tiianu«iw 
an  aspect,  with  complete  confidence.  Thetefere,  ladier  dun  employ  a  bhiaiy  dasrificatkm  of  n  hqmt  "wy-  as  an 
instance  of  an  aqiect,  the  nm-time  a^iect  daarificadan  synem  employs  a  coniinnoas  dasrification  method  bared  on 
dm  Dempster-Shafer  medwdokigy  [211.  Hgne  17  iDnstnin  the  ctesificaiian  mediod.  For  sinqdicity.  dm  Ohutiatioa 
is  limited  m  aspects  lying  on  a  shigle  Item  efrde  of  the  viewing  sphere.  Each  maicfa  with  a  templMe  from  a  single  fea- 
ture  ^neittes  a  likeidiood  distrdmtioii,  in  which  a  value  clnae  10 1  mean  dnt  the  oaneqoidiag  aspect  is  very  lately. 
LikeUhood  disiribaiiaas  firom  sepmaie  femmes  are  merged  using  Denqisier-aiafer  theory.  As  shown  in  the  figure, 
each  adtfitkmai  femme  lednoes  the  number  of  Bkety  agpects  mid  riotpem  the  pedes  of  the  mnaining  ones.  The  Ukea- 
hood  valnes  for  in^mssible  aspects  decreise  with  eadi  addhifl  featme,  while  the  lihelihood  values  of  possible 
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Mpeca  increase.  After  every  the  evidence  from  every  available  feamre  has  been  applied,  the  overall  likelihood  distri 
boiioRinay  sUll  contain  aevinl  peaks,  each  of  which  lepiesems  a  possibteaqieaclassificatkia  for  the  input  image. 


^-•OOtraclw^Shalar 
y  .  f  Opmlion 


PantampiMa 

Figure  17:  Aspect  classification  based  on  evidential  reasoning 


4.4.2  Verification 


The  verification  process  detennines  the  conect  aspea  by  matching  the  input  image  to  the  complete  template  for  each 
of  the  possible  aspects.  Each  template  can  move  over  the  oitiie  image  to  r  InLoize  the  total  eneigy.  The  total  energy 
is  comprised  of  a  weighted  sum  of  c(»straint  energy,  and  potential  eneigy: 


Potential  energy  rqnesena  the  energy  of  the  position  of  the  template,  while  Goostiaim  eoeigy  iqtresents  the  energy  of 
the  relations  between  template  conqtonents.  Potential  energy  is  readily  visualized  as  the  hdgltt  of  the  template  in  a 
potential  field  defined  by  the  detected  qiecularities  in  the  image.  Consnaim  energy  is  moddedby  springs  connecting 
the  template  points  to  image  feature  points;  as  the  template  defonns,  the  strings  stretch  and  the  coasnaint  energy 
increases.  Figure  18  illustrates  template  matching. 


An  optimization  procedure  is  used  to  find  the  energy  minimum.  Tb  avoid  gettmg  trapped  in  local  minima,  some  noise 
is  ad^  to  the  total  energy.  The  global  minimum  energy  for  each  template  qtecito  the  quality  of  fit  of  the  iiqnit 
image  U)  the  temifiate.  The  best  match  is  chosen  by  comparing  the  minimom  energies  for  cadi  of  the  candied 
aspects. 


4.5  Experiments 


In  this  section,  the  VAC  for  qiecular  objea  recognition  is  applied  to  two  dififerent  kinds  of  specular  images:  real  opti* 
cal  specular  images,  and  synthesized  synthetic  aperture  radar  (SAR)  images. 

4.5.1  Real  Optical  Images 

For  this  experimoit,  a  real  toy  airplane  was  constructed.  The  sensor  used  was  a  tv  camera  with  a  co-located  light 
source.  The  sensor  pataroeiers  were  obtained  by  calibration.  Hgure  19  shows  a  real  specular  image  of  the  toy  air- 
ptame. 

The  object  was  modeled  using  the  Vmtage  geometric  modeler  {2].  (Xtject  aspects  were  determined  using  die  exhaus- 
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(b)  (c) 

Figure  IS:  DefonnaUe  tospiate  amfchiDg. 

(a)  Tcnplate.  (b)  Potntial  energy,  (c)  Conatraiat  energy. 


Figarc  19:  Real  specalar  iMge  of  toy  airplaae. 


tive  metliod.  Since  man-made  objects  such  as  the  aapiaae  bare  only  a  few  stable  poees,  an  aqwetnug)  over  the  entiie 
viewing  sphere  was  not  generated.  Instead,  only  the  appearances  of  the  aiqdMie  fimm  the  equator  of  the  viewing 
sphere  were  considered.  Sample  im^es  were  generated  oa  the  equator  at  S*  increments  for  a  total  of  72  samjdes. 
Appearances  were  generated  using  a  sensor  simulator  (Q.  Rgure  20  fllnsmies  the  object  model  and  some  sanqrle 
appearances. 

The  concentric  arcs  in  Figure  21  correspond  to  the  visibaity  maps  for  each  primitive  sorfoce.  The  outermost  arc  cor¬ 
responds  to  the  delectable  directions  of  the  rear  fuselage;  The  aic  is  mdaoiEea  •  the  part  can  be  observed  from  aD 
viewing  directions.  The  top-left  image  in  shows  the  set  ttf  possible  appearances  of  qxcnlar  features  arising  from  the 
rear  fuselage  as  a  function  of  viewing  direction.  The  other  images  oorreapond  to  other  arcs  of  die  visibility  map.  Some 
of  the  arcs  in  the  map  are  broken;  die  missing  arc  regions  correspond  to  viewpoints  from  which  the  primitive  surface 
is  not  visible.  The  figure  shows  the  faBurea  in  their  computed  order  of  significance. 


Figure  20:  Model  airplane  audiiredktedspccnlariqipeanuiccs. 
(a)  Airplane  modeL  (b)  Saaqtle  appcarancca. 
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Figure  21:  Viiibility  and  signiScancc  of  features  in  optical  experiment. 

(a)  Visibility  map.  (b)  Specular  feataics  in  order  of  significance. 

To  test  the  resulting  recogniiioo  program,  a  real  specular  mmge  of  diB  toy  aiiplane  was  obamed  and  input  to  the  sys- 
tern.  The  first  step  in  the  lecognsttcn  process  is  aipea  classificatioa,  in  wbkh  possfijie  aqwcts  are  searched  by  match- 
ing  with  partial  temifiaies.  Figure  22  shows  the  input  image  and  die  results  of  matching  to  the  first  three  partial 

TIk.  rl«iriy  riinwe  the  nmmmnn^  ct  tim  lilwililintvt  iKmuihMim  a«  aditirinMl  mutching  i«  pt»ffinmii»d 

The  result  of  the  aspect  clamification  siqt  was  the  selectiaa  of  aqiects  at4S*,  US*.  123*.  IfiS*.  and  170'. 

Following  aspect  classification.  vetifiaBion  was  perfonned  using  defonnabieteaiplaie  matching.  Figure  23  illnatraies 
the  verification  stqi.  One  template  was  used  fior  each  of  the  aqiects  selected.  In  each  case  the  template  changed  its 
diape  to  match  the  qiecolarities  in  the  infaa  image  and  converged  to  the  sbqies  shown  by  die  white  lines  superim¬ 
posed  on  the  copies  of  the  real  image  in  the  figne.  The  ndtdmnm  energy,  and  hence  the  best  natdi.  was  obtained  ftv 
theaspeaatlTO*. 
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FIfnre  22:  Aspect  cfawtificatiop  singe  for  o|>tkal  experimeat 
(a)  Input  optical  fanime  (b)  AccnmnlatiMi  of  eviileBee  (c)  Airptane  parts. 


Resuit 


Figure  23:  Verification  stage  for  optical  inuife. 


4^^  Synthetic  Aperture  Kadar  Image 

Synthetic  8|)eftiire  radar  (S AR)  is  a  flyaUe  radar  system  that  is  often  used  on  ancraft  or  satrilites.  SAR  can  produce 
very  high-resolotioo  two-dimeosianal  images,  and  can  detect  details  oi  targets,  eagwcially  artificial  struennes.  SAR 
images  are  based  on  specular  leflectkiiis  of  radar  waves,  so  image  features  in  SAR  are  similar  ID  qwcobr  feainces  in 
optical  image.  In  spedfic,  the  featnres  ate  very  sensitive  to  change  in  object  orientation,  change  shape  abruptly,  and 
appear  or  disappear  suddenly. 

This  et^etiroent  was  petfijcmed  using  synthetic  data  only.  A  SAR  simalatar  (SARSDk^,  developed  by  TASC  PI,  was 
used  to  generate  symhetic  SAR  images.  The  simulator  created  not  only  bri^itness  images,  but  also  attribute  images 
which  specified  the  part  of  the  oltjea  which  caused  each  radar  feature. 
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An  aiiplane  iDodd  wM  oMd  (or  ifae  objea  inodd  thtt  WM  itadlir  10  Ihe  aiqitaDB  and  in  dw  apiiad  aqwrimatt,  tnt 
somewhat  mote  conplBs.  Tte  oiniMt  fiom  •  SAR  aenmkn  laiHlafwn  view.  By  aManittg  that  an  akp^  is  pariKd 
on  the  ground,  it  was  only  necessary  ID  coosider  the  powMa  appeal  tees  of  die  aiqiianecomMpoa^  to  vatioiis 
roiaiions  about  an  axis  penrendkular  to  die  troandLOednskais  were  not  considewd.  Viewing  iBiectiaos  were  aam- 
pkd  every  10*.  Figure  24  shows  the  aaplane  model  and  die  snapb  appearances. 


pife30 


Maiching  templaiei  were  geoemed  bied  on  the  predicted  feiarei.  Each  tenyhar  comigied  of  t  collection  of  bri^t 
lines  «id  brigiK  900.  Slice  noge  viloes  cw  be  detennined  from  SAR  inures,  it  ivM  pooible  10  nonnalize  the  size 
ahciift.  Hence,  vnriaiion  to  objea  size  WBS  not  considered. 

A  test  image  of  the  siipUne  w«  geneniBd  using  SARSIM.  Hie  fint  stage  of  the  nzHinie  praoess  wes  aspea  classifi- 

eatkMi-  whif-h  laAiead  ihe  manlier  ei  pntriWe  hy  mMrhing  twinilatM  Bjpm  'Vi  »li» 

aspect  clasriikationinocess.  The  first  match  lesulied  in  a  broad  Hkeiaiooddisiribotion.  After  maicliing  using  all  six 
effective  feaunes,  a  nanower  distiibutiGn  was  obtained,  which  rednced  the  number  of  possiMe  directioos  to  five. 


Ffnic  2d:  Aspect  daalficathm  stage  fbr  SAR  czpcrtaMnL 
(a)  Input  SAR  image,  (b)  Accumulated  HkeBhood  distributhms.  (c)  AherafI  parts. 


The  second  stage  of  the  nni<time  process  was  verificatioa.  in  which  ttefotweMa  w-mpiawj  tvoe  to  »arh  of 

the  candidate  aqiects.  Hgme  27  illustrates  the  verificatian  pwnfj*-—  The  upper  five  ww^gw  in  the  figure  ate  copies  of 
the  input  image.  Ibe  superimposed  white  figures  show  the  doomed  tHUphtas  at  the  minimum  eneigy  levels.  The 
kast  energy  of  the  five  temidM  was  obtained  for  the  templaiB  at  30*.  The  result  is  dhown  in  the  lower  image  (tf  the 
figure,  with  the  outline  of  the  ahersft  superimposed  over  die  original  hn^B. 


Rosuit 

P^urc  27:  Verification  stage  for  SAR  eapsriment. 
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5  Summary 

In  this  piper. ««  praaetted  die  laadigm  of  qi|M«noe4Medvjiian,  which  isaiaadigm  tebaildiog  ol^recogm 
lion  systems.  The  pMiihpii  is  called  ^ipettance4)Med,  taco  n  httegnl  Mp  is  the  prediction  nd  natysis  of  object 
sppeaiances.  An  sppesnnoe4»aed  system  is  adled  a  visiaB  alfarithm  coaqalei;  or  VAC  The  hqmt  to  a  VAC  is  a  set 
of  objea  and  sensor  models,  and  the  ompat  is  an  execothblB  digectrecogniliaa  pngtam. 

Appear«ice»baaed  systems  share  fotarprinc^de  defining  danwerirtcs: 

•  two-stage  process 

A  VAC  operates  in  two  distina  stages.  The  first  stage  is  perfianned  off-line,  and  consists  of  the  analysis  of 
appearances  and  the  genetationtrf  an  object  leeognitinnprognnn.  The  second  stage  is  perfonned  on-line, 
and  consists  the  esecotion  of  the  peevioiislygfnerawid  program. 

•  explicit  objea  and  sensor  modds 

A  VAC  embodies  an  overall  approach  to  olgect  recognitian  that  can  be  apfdied  to  a  variety  of  objects  and 
sensors.  Therefore,  eiqdicit  and  exchangeable  modds  are  ntilised.  Sensor  models  ^edto  the  fieanires 
detectable  by  the  sensor,  along  with  procedures  to  conqiotB  the  detectability  and  lefiaUlity  of  eadi  feamie. 
Object  models  inchide  geometric  and  photometric  properties  of  the  otjects. 

appearance  predictiao  and  andysis 

Objects  are  recognized  baaed  on  their  appearances  in  bwg**  Therefixe.  the  prediction  and  andysis  of 
appearance  is  fimdanientd  to  generating  competent  object  recogpiiion  programs.  A  VAC  predicts  qipear- 
ances  based  on  the  infonnation  in  dhjea  and  sensor  modds.  The  lange  of  appeaianoes  may  be  detennincd 
andytically  or  exhaustivdy.  and  the  appearances  mqr  be  predicted  analytically  or  throng  image  syntbe- 
sis. 

•  strategy  generirion 

Each  VAC  may  embody  a  different  approach  to  object  lecognitioa  Bor  exanyle.  the  first  VAC  presented 
processed  range  data  and  perfinmed  find  poae  determinatinn  nring  minimization  of  edge  locatioo  errors. 

The  second  VAC  processed  rprmlar  unages  and  perfinnned  pose  determinarion  using  defermaUe  tem¬ 
plates.  However,  the  output  of  a  VAC  is  an  cxecaiabiB  progiim  fir  object  recognitian.  lypically,  the  out- 
put  program  is  optimized  during  the  off-line  itage;  the  optimization  coats  are  paid  ba^^  cost-efficient 
execution  of  the  on-line  stage. 

The  history  of  computer  visian  research  has  consisted  fangdy  of  iGsearch  devoted  to  making  vision  systems  woriL  As 
a  result,  powerful  new  methods  have  been  devdoped  and  the  virion  systems  of  today  are  much  more  powerful  and 
competent  than  dune  of  the  pasL  However,  vision  systems  of  today  are  no  easier  or  cheaper  to  build  than  systems  in 
the  past  As  a  result,  computer  vision  is  not  as  widdy  qgdied  as  one  ntight  egqiect,  ifam  to  the  coat  of  systems. 

Appearance-based  vision  provides  one  approach  to  making  vision  systems  more  coat-effiBctive  by  providing  a  means 
of  automaticaUy  generating  object  recognition  systems.  Rather  ihn  reqpiring  time  and  effbtt  fiom  highly  trained 
individuals,  a  VAC  can  generate  a  competent,  cost-effiKtive  object  recognition  program  given  only  objea  and  sensor 
models.  The  example  VACs  presented  in  this  paper  demonsnated  both  the  elbodveness  and  the  fiexibility  of  appear- 
ance-based  virion. 
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